Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST's Lemmatiser
نویسندگان
چکیده
The Euroling stemmer is developed for a commercial web site and intranet search engine called SiteSeeker. SiteSeeker is basically used in the Swedish domain but to some extent also for the English domain. CST’s lemmatiser comes from the Center for Language Technology, University of Copenhagen and was originally developed as a research prototype to create lemmatisation rules from training data. In this paper we compare the performance of the stemmer that uses handcrafted rules for Swedish, Danish and Norwegian as well one stemmer for Greek with CST’s lemmatiser that uses training data to extract lemmatisation rules for Swedish, Danish, Norwegian and Greek. The performance of the two approaches are about the same with around 10 percent errors. The handcrafted rule based stemmer techniques are easy to get started with if the programmer has the proper linguistic knowledge. The machine trained sets of lemmatisation rules are very easy to produce without having linguistic knowledge given that one has correct training data.
منابع مشابه
Experiments to Investigate the Connection between Case Distribution and Topical Relevance of Search Terms in an Information Retrieval Setting
We have performed a set of experiments made to investigate the utility of morphological analysis to improve retrieval of documents written in languages with relatively large morphological variation in a practical commercial setting, using the SiteSeeker search system developed and marketed by Euroling AB. The objective of the experiments was to evaluate different lemmatisers and stemmers to det...
متن کاملPoor man's lemmatisation for automatic error classification
This paper demonstrates the possibility to make an existing automatic error classifier for machine translations independent from the requirement of lemmatisation. This makes it usable also for smaller and under-resourced languages and in situations where there is no lemmatiser at hand. It is shown that cutting all words into the first four letters is the best method even for highly inflective l...
متن کاملHybrid Stemmer for Gujarati
In this paper we present a lightweight stemmer for Gujarati using a hybrid approach. Instead of using a completely unsupervised approach, we have harnessed linguistic knowledge in the form of a hand-crafted Gujarati suffix list in order to improve the quality of the stems and suffixes learnt during the training phase. We used the EMILLE corpus for training and evaluating the stemmer’s performan...
متن کاملCombining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation
This paper presents a constraint-based morphological disambiguation approach that is applicable languages with complex morphology-specifically agglutinative languages with productive inflectional and derivational morphological phenomena. In certain respects, our approach has been motivated by Brill's recent work (Brill, 1995b), but with the observation that his transformational approach is not ...
متن کاملBulStem: Design and Evaluation of Inflectional Stemmer for Bulgarian
The paper starts with an overview of some important approaches to stemming for English and other languages. Then, the design, implementation and evaluation of BulStem – a freely available inflectional stemmer for Bulgarian, are presented. The problem is addressed from a machine-learning perspective using a large morphological dictionary. A detailed automatic evaluation in terms of under-stemmin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006